--- title: Benford's Law Application and Interpretation date: 2024-02-10 categories: - Pandas - Accounting ---
Benford's Law, also known as the Newcomb-Benford law or the first-digit law, is a surprising observation about the leading digits of numbers in real-world datasets. In many naturally occurring collections of data, smaller leading digits (like 1 and 2) are significantly more common than larger ones (like 8 and 9).
Real-world data often involves growth, multiplication, and comparisons across different scales. This "scaling invariance" creates a natural bias towards smaller leading digits.
Benford's Law can be a quick and a powerful tool for detecting anomalies or fraud in data. If a dataset supposedly reflects real-world data but significantly deviates from Benford's Law, it might indicate manipulated or fabricated numbers.
This notebook uses DC government's purchase card transactions data. From Open Data DC:
In an effort to promote transparency and accountability, DC is providing Purchase Card transaction data to let taxpayers know how their tax dollars are being spent. Purchase Card transaction information is updated monthly. The Purchase Card Program Management Office is part of the Office of Contracting and Procurement.
The latest dataset is available at https://opendata.dc.gov/datasets/DCGIS::purchase-card-transactions/about.
import pandas as pd
import numpy as np
import plotly.express as px
import plotly.graph_objects as go
df = pd.read_csv('DC_PCard_Transactions.csv')
df.head(3)
df.info()
The code below grabs the first digits of the
'TRANSACTION_AMOUNT' column after converting the column
into a string type.
# remove transactions with amounts that are negative or has a leading zero
# retrieve the first digit and use value_counts to find frequency
df_benford_first_digit = df['TRANSACTION_AMOUNT'] \
[df['TRANSACTION_AMOUNT'] >= 1] \
.astype(str).str[0] \
.value_counts() \
.to_frame(name="count") \
.reset_index(names="first_digit") \
.sort_values('first_digit')
# calculate percentages
df_benford_first_digit['actual_proportion'] = df_benford_first_digit['count'] / df_benford_first_digit['count'].sum()
df_benford_first_digit
Benford's proposed distribution of leading digit frequencies is given by
\begin{equation} P_i=\log _{10}\left(\frac{i+1}{i}\right) ; \quad i \in\{1,2,3, \ldots, 9\}, \end{equation}
where $P_i$ is the probability of finding $i$ as the leading digit in a given number.
Create a new column that contains the Benford's proposed distribution of leading digit frequencies.
# append an expected_proportion column that contains Benford's distribution
df_benford_first_digit['benford_proportion'] = [np.log10(1 + 1 / i) for i in np.arange(1, 10)]
df_benford_first_digit
fig = px.bar(
data_frame=df_benford_first_digit,
x='first_digit',
y=['actual_proportion', 'benford_proportion'],
title='<b>Proportions of Leading Digits for P-Card Transactions</b><br>\
<span style="color: #aaa">Compared with Benford\'s Proposed Proportions</span>',
labels={
'first_digit': 'First Digit',
},
height=500,
barmode='group',
template='simple_white'
)
fig.update_layout(
font_family='Helvetica, Inter, Arial, sans-serif',
yaxis_title_text='Proportion',
yaxis_tickformat=',.0%',
legend_title=None,
)
fig.data[0].name = 'Actual'
fig.data[1].name = 'Benford'
fig.show()